Introduction

When I was first tasked with performing a data analysis related to this position, I had all sorts of big ideas on how I would proceed. In my mind, I would have two or three relational data sets, where I could perform a few different joins, clean things up, summarize the important columns, and look for insights. What I found was: example relational data sets are hard to come by!

Instead, I found three real-world examples of government employee data from Bloomington, Chicago, and Marin. The data are not relational, and they have different columns, but they are real data sets with all of the problems that come from real, data sets. In this readme, I will talk about the data, how I learned about it, cleaned it, and summarized it. At the end, there is also a brief statistical analysis of salaries from Chicago.

R Packages Used

First, the R Packages used in this analysis are:
* Tidyverse - tibbles, dplyr, read_csv
* DT - nice, packaged data tables
* Plotly - the final, interactive graph uses ggplotly

The Data

I found three government workforce data sets online from different public websites and downloaded them in csv format. You can view these data sets for yourself in my github repo. The data was read in via the read_csv() package offered in tidyverse, which loads them into Tibble data frames, making dplyr queries natural.

Data Issues

Columns

Immediately after loading the data into R, I began to summarize the data and look for issues. The first thing that popped out was that each data set had different naming convention and some columns had spacing. I converted those to a more predictable and useful naming style during the import into R, and set the appropriate text columns to factors.

Duplicate Names

The next obvious issue was that each city’s data had duplicate names or individuals. With Bloomington and Chicago, I would assume it occurred due to positional changes, promotions, or salary adjustments. With Marin, duplicates arose because there was multiple years of data instead of just one.

Using dplyr, I grouped the data to take maximums for salary for Bloomington and Chicago, and I filtered to a single year for Marin. An example of each in SQL would be:

Sample SQL to check for duplicate names

select names, count(*) as num_rows
from bloomington
group by names
having num_rows > 1

Sample SQL to remove duplicate names

select names, departments, max(salaries) as salaries
from bloomington
group by names, departments

Salaries

Bloomington and Chicago both had salary information, though Marin did not. When looking at Bloomington, for instance, it was apparent that there were part time jobs mixed in with full time jobs that make parsing through the data more difficult. For example, looking below, the ‘Public Board of Safety’ salary is only $634, and ‘Public Works’ salaries range from $2k to over $100k.

Looking at a density function of the salary data, it appears bi-modal which I think makes sense considering there are two types of salaries in the data. With no better way to find which people are part time, cutting off at around the 25% quantile mark seems like a reasonable place to start. This is coincidentally right near the $25k mark.

The area to the left of that mark, in the shaded region below, will be labeled as part time in the data, and all else will be labeled full time.

Once that change was made, we can see there are far less discrepancies in the quantiles, the mean and median are closer, and when they differ they are right tailed (due to higher salary ranges). To be more accurate, aside from having better quality data, these quantiles and analyses could have been taken at a department by department basis, but I think for brevity $25k cutoff was reasonable.

Chicago, on the hand, does have a full-time vs part-time flag. The only issue worth considering in that data was the salaries and hourly columns were separated. I combined them into one - taking the salary value when the person was salary, and filling the rest in with hours * rate * 52 weeks.

Looking below, we can see Chicago’s density curve looks much more normal and less bi-modal than the Bloomington data did. Nothing much else jumps out, other than the obvious fact that salary ranges are much larger in Chicago than in Bloomington.

With the Marin data, lacking salary information was unfortunate. I did not spot much in summarizing the data, other than the duplicate names due to multiple years, that stood out for issues. Instead, I decided to break each department out by gender. If we wanted to do an analysis on diversity and inclusion, we might want to check that our department numbers are inline or better than neighboring cities.

Chicago Salary Analysis

Lastly, I wanted to pick one of the cities to dive deeper into, and Chicago had the most accurate and usable information. After some deliberation and advisement, I chose to see if it was possible to find some outlier salaries and present them in an interesting way. To do this, I created a simple linear model that predicted salary based solely on department, and the created 95% prediction intervals.

The below chart gives the user the ability to look at each department in a jittered scatter plot, with 95% prediction intervals, and a tool-tip that displays the name of the person who corresponds with the point on the graph. I suppose for someone familiar with Chicago politics, looking at outliers on the upper end, you would expect to know the names of those individuals who are far out-earning their peers.

Conclusions

Finding good and complete data to work with is not as easy as it seems, but at the same time, working with real, published city workforce data was interesting and rewarding. With more data and time, I could see a comparison of workforce sizes, budget, and diversity as obvious next steps.

Like most data sets, these had issues that had to be found and worked around. Whether the issue is just general lack of data to segment by, duplicates, or strangely formatted columns, missing data, or incorrect data, there is nearly always work to be done and many tools to do it with.

For me, SQL has been one of the most powerful tools I have used - especially when working with large datasets. Packages like Tidyverse make this type of work a breeze as well, as does Pandas in Python. If given the option, SAS works amazing as well.

I hope I was able to showcase my ability to consume data I am not familiar with, learn about it, summarize it, clean it, analyze it, and present it in a relatively short amount of time.